Using Misclassification Analysis for Data Cleaning

نویسندگان

Piyasak Jeatrakul

Kok Wai Wong

Chun Che Fung

چکیده

It is posted here for your personal use. No further distribution is permitted. Data cleaning is a pre-processing technique used in most data mining problems. The purpose of data cleaning is to remove noise, inconsistent data and errors in order to obtain a better and representative data set to develop a reliable prediction model. In most prediction model, unclean data could sometime affect the prediction accuracies of a model. In this paper, we investigate classification problem, which make use of misclassification analysis technique for data cleaning. To demonstrate our concept, we have used artificial neural network (ANN) as the core computational intelligence technique. We use three benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository to investigate the results from our proposed data cleaning technique. The experimental data sets used in our experiment are binary classification problems, which are German credit data, BUPA liver disorders, and Johns Hopkins Ionosphere. The results from our experiments show that the proposed cleaning technique could be a good alternative to provide some confidence when constructing a classification model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Cleaning for Classification Using Misclassification Analysis

In most classification problems, sometimes in order to achieve better results, data cleaning is used as a preprocessing technique. The purpose of data cleaning is to remove noise, inconsistent data and errors in the training data. This should enable the use of a better and representative data set to develop a reliable classification model. In most classification models, unclean data could somet...

متن کامل

Adherence to osteoporosis pharmacotherapy is underestimated using days supply values in electronic pharmacy claims data.

PURPOSE Days supply (prescription duration) values are commonly used to estimate drug exposure and quantify adherence to therapy, yet accuracy is not routinely assessed, and potential inaccurate reporting has been previously identified. We examined the impact of cleaning days supply values on the measurement of adherence to oral bisphosphonates. METHODS We identified new users of oral bisphos...

متن کامل

Binary Regression With a Misclassified Response Variable in Diabetes Data

Objectives: The categorical data analysis is very important in statistics and medical sciences. When the binary response variable is misclassified, the results of fitting the model will be biased in estimating adjusted odds ratios. The present study aimed to use a method to detect and correct misclassification error in the response variable of Type 2 Diabetes Mellitus (T2DM), applying binary ...

متن کامل

An application of Measurement error evaluation using latent class analysis

‎Latent class analysis (LCA) is a method of evaluating non sampling errors‎, ‎especially measurement error in categorical data‎. ‎Biemer (2011) introduced four latent class modeling approaches‎: ‎probability model parameterization‎, ‎log linear model‎, ‎modified path model‎, ‎and graphical model using path diagrams‎. ‎These models are interchangeable‎. ‎Latent class probability models express l...

متن کامل

تحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378

Misclassification of disease status and risk factors is one of the main sources of error in studies. Wrong assignment of individuals into exposed and non-exposed groups may seriously distort the results in case-control studies. This study investigates the effect of misclassification error on odds ratio estimates and attempts to introduce a correction method. Data on 3332 men aged 30-69 years fr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Using Misclassification Analysis for Data Cleaning

نویسندگان

چکیده

منابع مشابه

Data Cleaning for Classification Using Misclassification Analysis

Adherence to osteoporosis pharmacotherapy is underestimated using days supply values in electronic pharmacy claims data.

Binary Regression With a Misclassified Response Variable in Diabetes Data

An application of Measurement error evaluation using latent class analysis

تحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378

عنوان ژورنال:

اشتراک گذاری